-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change schema_infer_max_rec
config to use Option<usize>
rather than usize
#13250
Change schema_infer_max_rec
config to use Option<usize>
rather than usize
#13250
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alihan-synnada
There are cases where we need to know if the user explicitly set the schema_infer_max_rec config.
It seems like checking for 0
would be sufficient to see if they set the option, as 0
is not a valid value for schema inference (you can't infer a schema with no records) 🤔
Using an Option seems fine to me as it is more consistent, but I do think we should keep the CSV and JSON options consistent
@@ -1773,7 +1773,7 @@ config_namespace! { | |||
/// Options controlling JSON format | |||
pub struct JsonOptions { | |||
pub compression: CompressionTypeVariant, default = CompressionTypeVariant::UNCOMPRESSED | |||
pub schema_infer_max_rec: usize, default = 100 | |||
pub schema_infer_max_rec: Option<usize>, default = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make this change to be Option
then it is inconsistent with the JSON options:
datafusion/datafusion/common/src/config.rs
Lines 1775 to 1776 in 2d7892b
pub compression: CompressionTypeVariant, default = CompressionTypeVariant::UNCOMPRESSED | |
pub schema_infer_max_rec: usize, default = 100 |
I think it would be best to keep the two options consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did change both CsvOptions::schema_infer_max_rec
and JsonOptions::schema_infer_max_rec
so they are consistent. I think the PR body was confusing because of the 2nd change (lifetimes) mentioning CSV. I'll rephrase the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see -- I was clearly confused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alihan-synnada -- this makes sense to me and is consistent with other options
@@ -1773,7 +1773,7 @@ config_namespace! { | |||
/// Options controlling JSON format | |||
pub struct JsonOptions { | |||
pub compression: CompressionTypeVariant, default = CompressionTypeVariant::UNCOMPRESSED | |||
pub schema_infer_max_rec: usize, default = 100 | |||
pub schema_infer_max_rec: Option<usize>, default = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see -- I was clearly confused
schema_infer_max_rec
config to use Option<usize>
rather than usize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM also, thank you for the review @alamb
Which issue does this PR close?
No related issue
Rationale for this change
I came across the following obstacles trying to write some custom schema inference code:
schema_infer_max_rec
config.read_to_delimited_chunks_from_stream
does not necessarily have to beBoxStream<'static>
What changes are included in this PR?
CsvOptions::schema_infer_max_rec
andJsonOptions::schema_infer_max_rec
Option
sAre these changes tested?
Tested with existing code and tests.
Are there any user-facing changes?
CsvOptions::schema_infer_max_rec
andJsonOptions::schema_infer_max_rec
are nowOption<usize>
s as opposed tousize
(with_schema_infer_max_rec
signature is unchanged)read_to_delimited_chunks
,read_to_delimited_chunks_from_stream
,convert_stream
,convert_to_compress_stream
now take lifetime parametersI believe (1) is a breaking change but I'm not sure about (2)